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ABSTRACT 

We report on the Gaussian file search system designed as part of 
the ChemXSeer digital library. Gaussian files are produced by 
the Gaussian software |4|, a software package used for calculating 
molecular electronic structure and properties. The output files are 
semi- structured, allowing relatively easy access to the Gaussian at- 
tributes and metadata. Our system is currently capable of searching 
Gaussian documents using a boolean combination of atoms (chem- 
ical elements) and attributes. We have also implemented a faceted 
browsing feature on three important Gaussian attribute types - Ba- 
sis Set, Job Type and Method Used. The faceted browsing feature 
enables a user to view and process a smaller, filtered subset of doc- 
uments. 

Categories and Subject Descriptors 

H.3.7 [Information Storage and Retrieval]: Digital Libraries; 

H. 5.2 [Information Interfaces and Presentation]: User Interfaces — 
graphical user interfaces ( GUI), interaction styles, screen design, 
user-centered design 

General Terms 

Design, Documentation 

Keywords 

ChemXSeer, Gaussian software, Chemoinformatics, Faceted search 

I. INTRODUCTION 
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ChemXSeer is a digital library and data repository for the Chemoin- 
formatics and Computational Chemistry domains [8 |. It currently 
offers search functionalities on papers and formulae, CHARMM 
calculation data and Gaussian computation data, and also features 
a comprehensive search facility on chemical databases. A table 
search functionality [7], similar in spirit to the one featured in Cite- 
Seer)^ is currently under development. Gaussian document search 
has been a key component of ChemXSeer from its inception. The 
alpha version of Gaussian search featured a simple query box and 
an SQL back-end. Here we describe the next generation of Gaus- 
sian searc 10 which includes a customized user interface for Com- 
putational Chemistry researchers, boolean query functionality on a 
pre-specified set of attributes, and a faceted browsing option over 
three key attribute types. The current version of Gaussian search is 
powered by Apache Solij^ a state-of-the-art open-source enterprise 
search engine indexer. 

The organization of this paper is as follows. In Section [2] we 
give a brief overview of the Gaussian software and Gaussian files, 
emphasizing the need for a customized search interface rather than 
a simple one. Description of the search interface appears in Sec- 
tion [3] followed by a brief sketch of related work in Section |4] We 
conclude in Section [5] outlining our contributions and providing 
directions for future improvement. 

2. GAUSSIAN FILES 

Computational chemists perform Gaussian calculations to deter- 
mine properties of a chemical system using a wide array of com- 
putational methods. The methods include molecular mechanics, 
ground state semi-empirical, self-consistent field, and density func- 
tional calculations. Computational methods such as these are key to 
the upsurge of interest in chemical calculations, partly because they 
allow fast, reliable, and reasonably easy analysis, modeling, and 
prediction of known and proposed systems (e.g., atoms, molecules, 
solids, proposed drugs, etc.) under a wide range of physical con- 
straints, and partly because of the availability of well-tested, com- 
prehensive software packages like Gaussian that implement many 

^ http://citeseerx.ist.psu.edu/ 

^http://cxs05. ist.psu.edu: 8080/ChemXSeerGaussianSearch 
^http://lucene.apache.org/solr/ 
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Figure 1: Screenshot of a Gaussian document. 
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Figure 2: First-generation Gaussian query interface. 



of these methods with good tradeoff between accuracy and process- 
ing time. 

The Gaussian software is actually a suite of several different 
chemical computation models, including packages for molecular 
mechanics, Hartree-Fock methods, and semi-empirical calculations. 
While the exact details of the functionalities of this software are 
beyond the scope of this papeij^ we would like the reader to note 
that each run of the Gaussian software is equivalent to conducting a 
chemical experiment with certain inputs and under certain physico- 
chemical conditions. The output of the software consists of a large 
amount of information returned to the user via the computer con- 
sole and usually redirected to a suitably-named output file. We are 
interested in these output files, henceforth referred to as "Gaussian 
files" or "Gaussian documents". 

The Gaussian files contain detailed information about the calcu- 
lations being performed on the system of interest. Although the 
details of the calculations are essential for the analysis of the sys- 
tem being studied, the output file can be cumbersome to a new user. 
Each Gaussian file begins with the issued command that initiated a 
particular calculation, followed by copyright information, memory 
and hard disk specification, basis set, job type, method used, and 
several different matrices (e.g., Z-matrix, distance matrix, orienta- 
tion matrix, etc.). It may also contain other information like rota- 
tional constants, trust radius, maximum number of steps, and steps 
in a particular run. Gaussian files are semi- structured (Figure[T]) in 
the sense that these parameters tend to appear in a particular order 
or with explicit markups. 

Since Gaussian files are important to the design, testing and pre- 
diction of new chemical systems, ChemXSeer had integrated a search 



"^For details, please see 

http://www.gaussian.com/g_tech/g_ur/g09help.htm 
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Figure 3: Gaussian search system architecture. 



functionality on these files. The alpha version of Gaussian search 
interface only consisted of a simple query box (Figure [2]), and the 
back-end of the search engine was an SQL database that stored 
data extracted from the Gaussian files. Although simple, the inter- 
face allowed users to type in fielded queries and view results in an 
easy-to-understand format. In the current version, we have retained 
many aspects of the alpha version, including parts of the search re- 
sults page and visual representation of individual Gaussian files. 

However, our domain experts argued that a more complex inter- 
face including faceted search was justified, partly because it eases 
the task of a researcher by limiting the number of search results 
to examine, and partly because such interfaces have already been 
successfully implemented ||9). A computational chemist usually 
knows what kinds of parameters he/she is looking for in a Gaussian 
files database, and therefore it makes sense to refine search results 
using this information. We identified three important parameters 
towards this end - Job Type, Method Used and Basis Set. There are 
other parameters and metadata that we can extract from the Gaus- 
sian files, but they are not as important from a domain expert's point 
of view. These are Charge, Degree of Freedom, Distance Matrix, 
Energy, Input Orientation, Mulliken Atomic Charge, Multiplicity, 
Optimized Parameters, Frequencies, Thermo-chemistry, Thermal 
Energy, Shielding Tensors, Reaction Path, PCM, and Variational 
Results. Metadata like ID, Title and File Path are used in organiz- 
ing the search results. 

3. SYSTEM DESCRIPTION 

The basic query to the Gaussian search system is an atom (i.e., 
element) or a collection of atoms. The system returns all Gaus- 
sian files containing those atoms. However, as experienced by re- 
searchers, such basic queries often return a large number of search 
results, many of which are not relevant. While we can think of 
improving the ranking of search results in tune with traditional in- 
formation retrieval research, domain experts have informed us that 
since Gaussian files are semi- structured, a faceted browsing option 
would be more appropriate. It remains open, however, whether 
ranking within each facet could be improved. Currently we rank 
the search results by their external IDs, because our domain experts 
were not overly concerned with the ranking. 

The system architecture is given in Figure [3] Figure |3] has three 
principal components - the query interface, the search results page 
and the Gaussian file description page. The user supplies a query 
using the query interface, consisting of atoms (mandatory field), 
method used, job type and basis set. The last three fields are op- 
tional, and can be combined in boolean AND/OR fashion. The 
boolean query goes to the Gaussian document index, which in turn 
returns on the search results page all Gaussian files satisfying the 
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Figure 4: Gaussian query interface. 



Table 1: Gaussian Attribute Categories 



Job Type 


Method Used 


Basis Set 


Any 


Any 


Any 


Single Point 


Semi-empirical 


gen 


Opt 


Molecular Mechanics 




Freq 


Hartree-Fock 




IRC 


MP Methods 




IRCMax 


DFT Methods 




Force 


Multilevel Methods 




ONIOM 


CI Methods 




ADMP 


Coupled Cluster Methods 




BOMD 


CASSCF 




Scan 


BD 




PBC 


OVGF 




SCRF 


Huckel 




NMR 


Extended Huckel 
GVB 
CBS Methods 





boolean query. The search results page contains links to individual 
Gaussian file descriptions, which in turn link to the actual Gaussian 
documents. Figure [3] also indicates that the index was generated 
from Gaussian documents using Apache Solr. 

The lower section of Figure [Sjexplains the faceted browsing part. 
Facets are created based on three attributes - job type, method used, 
and basis set. Each facet link consists of an attribute, its value, and 
the number of search results under the current set that satisfy this 
value. The search results page contains links to different values of 
the attributes. When the user clicks on such a link, a refined query 
is sent to the Gaussian document index and the resulting smaller set 
of search results is returned. 

The implementation of our query interface (Figure P]) was in- 
spired partly by the EMSL Basis Set Exchange interfacqjand partly 
by the requirements mentioned by our domain experts. Our inter- 
face features a periodic table of elements, where users can click 
to select and de- select each element (atom) individually. The se- 
lected elements appear together in the textbox at the bottom of the 
table. Users can specify whether they want search results that con- 
tain only the selected elements - no more and no less, or whether 
they want search results that contain the selected elements as well 

^ https://bse.pnl.gov/bse/portal 
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Figure 5: A search results page. 




Figure 6: A Gaussian file description. 

as other elements. After selecting elements, users can optionally 
select Job Type, Method Used, and Basis Set from the drop-down 
menus provided. They can also directly type in the desired val- 
ues for these attributes in the textboxes. Finally, they can specify 
AND/OR from another drop-down menu. The default option is 
AND. Fourteen Job Type categories (values), sixteen Method Used 
categories, and two Basis Set categories are provided in the drop- 
down menus. These categories are given in Table [T] Each category 
has several sub-categories that are dealt with by the search sys- 
tem. For example, if a user specifies "Hartree-Fock" as the Method 
Used category, the system will search for four sub-categories of 
Hartree-Fock - hf, rhf, rohf and uhf. These sub-categories were 
specified by our domain experts. A sample of Method Used sub- 
categories is given in Table|2] Table|2]shows the sub-categories for 
three Method Used categories - Molecular Mechanics, CI Methods, 
and CBS Methods. For the Basis Set attribute there are many cat- 
egories, but only two options are provided in the drop-down menu 
to keep it short and simple. Users can type in the category (e.g., 
3-2 IG*) in the textbox provided. 

Ten search results are shown in one search results page (Fig- 
ure [Sj with the total number of results shown at the top. Note that 
the left part of the search results page (Figure [Sj contains links 
for faceted browsing, and the right part contains the actual results. 
Each search result consists of a link to the corresponding Gaus- 
sian file description and a one-line summary of the file containing 
attribute information. The Gaussian file description (Figure|6| con- 
sists of a Jmol ||5] rendering of the system being studied, followed 
by a summary of the Gaussian job and information about attributes 
extracted from the file. The summary contains a link to the Gaus- 
sian document (Figure [6|. Currently we have indexed 2148 docu- 
ments. 

The faceted browsing section (left half of Figure |5]) follows the 
architectural specification of Figure [3] Users can refine search re- 
sults any time simply by clicking on a particular attribute category. 
An 'All Results" link has been provided to help users quickly find 
the original set of results. Anecdotal evidence from our domain 



Table 2: A sample of Method Used sub-categories 



Molecular Mechanics 


CI Methods 


CBS Methods 


amber 


cis 


cbs-4m 


drieding 


cis(d) 


cbs-lq 


uff 


cid 


cbs-q 




cisd 


cbs-qb3 




qcisd 


cbs-apno 




qcisd(t) 






sac-ci 





experts suggests that the faceted browsing feature has been able to 
significantly cut down on the number of search results to exam- 
ine, thereby saving a considerable amount of time on the part of a 
Computational Chemistry researcher. Moreover, since each facet 
link gives the number of search results to examine for a particular 
attribute category, a user can readily obtain a visual appreciation of 
the distribution of search results across different attribute categories 
for a single query. 

The core search and indexing functionality of Gaussian search 
is currently provided by Apache Solr, an open-source state-of-the- 
art enterprise search server designed to handle, among other things, 
faceted search, boolean queries, and multivalued attributes. In our 
case, atoms (chemical elements) in a Gaussian document comprise 
a multivalued attribute. Each Gaussian document was converted by 
our home-grown metadata extractor into an XML- style file suitable 
for ingestion to Solr. The selection of Solr as the back-end plat- 
form for this system was partly motivated by the need to integrate 
ChemXSeer architecture with SeerSuit^ a package of open-source 
software tools that powers the CiteSeerX digital library. 

4. RELATED WORK 

In this section we give a brief sketch of the related work. The im- 
portance of using large databases to support chemistry calculations 
has been illustrated by Feller in |3 |. Schuchardt, et al., describe 
such a database, the Basis Set Exchange |9|. Basis Set Exchange 
helps users find particular basis sets that work on certain collec- 
tions of atoms, while ChemXSeer lets users search Gaussian files 
with basis sets as boolean query components. 

Among other purely chemistry-domain digital libraries, OREChem 
ChemXSeer by Li, et al. | 6 | integrates semantic web technology 
with the basic ChemXSeer framework. The Chemical Education 
Digital Library |2| and the JCE (Journal of Chemical Education) 
Digital Library 1 1 1 focus on organizing instructional and educa- 
tional materials in Chemistry. Both these projects are supported by 
NSF under the National Science Digital Library (NSDL). In con- 
trast with these studies, our focus here is to design a search func- 
tionality on Gaussian files that helps domain experts locate attribute 
information more easily. 

5. CONCLUSION 

In this paper our contributions are two-fold: 

• design of a new search engine for Computational Chemistry 
research on documents produced by the widely used Gaus- 
sian software, and 

• design of a metadata extractor that sieves out several im- 
portant attributes from the Gaussian documents, and exports 
them into Solr-ingestible XML format. 

^http://sourceforge.net/projects/citeseerx/ 



Future work consists of integration of documents from the ChemXSeer 
Digital Library with Gaussian files so that users can have an inte- 
grated view of calculations, results, and analysis. The metadata 
extractor could also be improved. There are a few cases where our 
metadata extractor could not locate certain attribute values, mainly 
due to the anomalous placement of those attributes in the Gaussian 
output files. The structure of these documents appeared inconsis- 
tent in certain places. Information extraction techniques may be 
useful for handling these cases. Another area of potential research 
is improving the ranking of search results. Although our domain 
experts were not concerned with ranking, it remains to be seen if 
combining attribute information can help pull up more relevant files 
earlier in the ranking. Finally, Section |2] indicates the presence of 
several other attributes in the Gaussian documents. It would be in- 
teresting to explore whether these attributes are useful and can be 
leveraged to produce additional relevant information. 
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